Search CORE

16 research outputs found

A graph-search framework for associating gene identifiers with documents

Author: A Yeh
AM Cohen
AM Cohen
AM Cohen
C Zhai
Consortium TGO
D Hanisch
E Hatcher
E Minkov
E Minkov
Einat Minkov
F Sha
J Crim
K Franzén
K Fundel
K Humphreys
L Hirschman
L Hirschman
M Collins
M Craven
R Bunescu
RI Kondor
T Rindflesch
U Leser
William W Cohen
WW Cohen
WW Cohen
WW Cohen
Y Altun
Y Freund
Z Kou
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: One step in the model organism database curation process is to find, for each article, the identifier of every gene discussed in the article. We consider a relaxation of this problem suitable for semi-automated systems, in which each article is associated with a ranked list of possible gene identifiers, and experimentally compare methods for solving this geneId ranking problem. In addition to baseline approaches based on combining named entity recognition (NER) systems with a "soft dictionary" of gene synonyms, we evaluate a graph-based method which combines the outputs of multiple NER systems, as well as other sources of information, and a learning method for reranking the output of the graph-based method. RESULTS: We show that named entity recognition (NER) systems with similar F-measure performance can have significantly different performance when used with a soft dictionary for geneId-ranking. The graph-based approach can outperform any of its component NER systems, even without learning, and learning can further improve the performance of the graph-based ranking approach. CONCLUSION: The utility of a named entity recognition (NER) system for geneId-finding may not be accurately predicted by its entity-level F1 performance, the most common performance measure. GeneId-ranking systems are best implemented by combining several NER systems. With appropriate combination methods, usefully accurate geneId-ranking systems can be constructed based on easily-available resources, without resorting to problem-specific, engineered components

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A new pairwise kernel for biological network inference with support vector machines

Author: A Ben-Hur
A Ramani
B Schölkopf
C Harbison
C von Mering
E Sprinzak
E Xing
EM Marcotte
F Pazos
GD Bader
GRG Lanckriet
GS Kimeldorf
HW Mewes
IW Tsang
Jean-Philippe Vert
Jian Qiu
JP Vert
KQ Weinberger
N Aronszajn
N Friedman
P Pavlidis
R Jansen
RI Kondor
S Boyd
S Martin
SF Altschul
SM Gomez
VN Vapnik
William S Noble
WK Huh
Y Qi
Y Yamanishi
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

International audienceBACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data

Crossref

Springer - Publisher Connector

PubMed Central

HAL Descartes

HAL-MINES ParisTech

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Author: A Su
B Brancotte
B Calvo
B Linghu
B Liu
B Schölkopf
B Schölkopf
B Schölkopf
C Giallourakis
C Perez-Iratxeta
C Son
CC Chang
EA Adie
F Denis
F Mordelet
Fantine Mordelet
FS Turner
G Lanckriet
GRG Lanckriet
J Freudenberg
Jean-Philippe Vert
K Bleakley
K Lage
L Jacob
L Jacob
LC Tranchevent
M van Driel
N López-Bigas
N Tiffin
O Vanunu
P Pavlidis
RI Kondor
S Aerts
S Köhler
S Yu
T De Bie
T Evgeniou
T Hwang
U Ala
V McKusick
X Wu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Results We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases. Conclusions ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Enhanced protein fold recognition through a novel data integration approach

Author: A Andreeva
A Rakotomamonjy
AL Yuille
B Schölkopf
C Ding
CA Micchelli
CE Rasmussen
Colin Campbell
DT Jones
F Bach
F Bach
GRG Lanckriet
GRG Lanckriet
HB Shen
HW Mewes
I Dubchak
J Shawe-Taylor
J Ye
J Ye
JM Borwein
JV Davis
K Bleakley
K Chou
K Tsuda
Kaizhu Huang
L Liao
L Lo Conte
L Sun
L Vandenberghe
M Girolami
N Aronszajn
N Cristianini
ND Lawrence
PD Tao
R Hettich
RI Kondor
S Amari
S Ji
S Sonnenburg
T Damoulas
T Hastie
T Kato
Y Lin
Y Nesterov
Y Yamanishi
Y Ying
Yiming Ying
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources. Results In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. We propose a novel information-theoretic approach based on a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multi-class classification and multi-task learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KL-divergence objective, there are two formulations which we respectively refer to as <it>MKLdiv-dc </it>and <it>MKLdiv-conv</it>. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdiv-conv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem. Conclusion Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-of-the-art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Explore Bristol Research

Candidate gene prioritization by network analysis of differential expression using machine learning approaches

Author: A Subramanian
A Zanzoni
AJ Smola
AP Francisco
B Aranda
B Harr
Bart de Moor
C Saunders
C Stark
C von Mering
D Nitsch
D Zieker
Daniela Nitsch
F Chung
F Fouss
Fabian Ojeda
GC Cawley
GD Bader
H Yang
HY Chuang
J Chen
JA Hanley
Joana P Gonçalves
JW Park
K Lage
KR Brown
L Franke
L Gautier
L Salwinski
LC Tranchevent
M Liu
P Baldi
P Pagel
R Gupta
RA Irizarry
RI Kondor
RK Nibbe
S Aerts
S Köhler
S Mirkin
S Razick
S Vardhanabhuti
SE Choe
T Fawcett
WK Lim
Y Saad
Yves Moreau
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Discovering novel disease genes is still challenging for diseases for which no prior knowledge - such as known disease genes or disease-related pathways - is available. Performing genetic studies frequently results in large lists of candidate genes of which only few can be followed up for further investigation. We have recently developed a computational method for constitutional genetic disorders that identifies the most promising candidate genes by replacing prior knowledge by experimental data of differential gene expression between affected and healthy individuals. To improve the performance of our prioritization strategy, we have extended our previous work by applying different machine learning approaches that identify promising candidate genes by determining whether a gene is surrounded by highly differentially expressed genes in a functional association or protein-protein interaction network. Results We have proposed three strategies scoring disease candidate genes relying on network-based machine learning approaches, such as kernel ridge regression, heat kernel, and Arnoldi kernel approximation. For comparison purposes, a local measure based on the expression of the direct neighbors is also computed. We have benchmarked these strategies on 40 publicly available knockout experiments in mice, and performance was assessed against results obtained using a standard procedure in genetics that ranks candidate genes based solely on their differential expression levels (<it>Simple Expression Ranking</it>). Our results showed that our four strategies could outperform this standard procedure and that the best results were obtained using the <it>Heat Kernel Diffusion Ranking </it>leading to an average ranking position of 8 out of 100 genes, an AUC value of 92.3% and an error reduction of 52.8% relative to the standard procedure approach which ranked the knockout gene on average at position 17 with an AUC value of 83.7%. Conclusion In this study we could identify promising candidate genes using network based machine learning approaches even if no knowledge is available about the disease or phenotype.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Network Analysis of Differential Expression for the Identification of Disease-Causing Genes

Author: AM Yip
Bernard Thienpont
C Moehle
C von Mering
Daniela Nitsch
DB Mount
DN Cox
EH Rosenberg
EK Malmberg
F Fouss
FJ Probst
FR Bach
FR Bach
Gustavo Goldman
H Parkinson
Hilde Van Esch
HY Chuang
J Johnson
JM Wright
JR Riordan
K Kyo
K Lage
Koenraad Devriendt
L Bubendorf
L Franke
Lieven Thorrez
Léon-Charles Tranchevent
M Bakay
M Cortón
M Simoni
M Urbanek
M Urbanek
MR Jones
N Kotaja
P Moretti
PE Becker
RI Kondor
S Aerts
S Draghici
S Fine
S Franks
S Ina
S Kuramochi-Miyagawa
S Köhler
SS Tanaka
T Barrett
T Noce
T Watanabe
TK Gandhi
Y Nishimura
Yves Moreau
Z Yao
Publication venue: Public Library of Science
Publication date: 01/05/2009
Field of study

Genetic studies (in particular linkage and association studies) identify chromosomal regions involved in a disease or phenotype of interest, but those regions often contain many candidate genes, only a few of which can be followed-up for biological validation. Recently, computational methods to identify (prioritize) the most promising candidates within a region have been proposed, but they are usually not applicable to cases where little is known about the phenotype (no or few confirmed disease genes, fragmentary understanding of the biological cascades involved). We seek to overcome this limitation by replacing knowledge about the biological process by experimental data on differential gene expression between affected and healthy individuals. Considering the problem from the perspective of a gene/protein network, we assess a candidate gene by considering the level of differential expression in its neighborhood under the assumption that strong candidates will tend to be surrounded by differentially expressed neighbors. We define a notion of soft neighborhood where each gene is given a contributing weight, which decreases with the distance from the candidate gene on the protein network. To account for multiple paths between genes, we define the distance using the Laplacian exponential diffusion kernel. We score candidates by aggregating the differential expression of neighbors weighted as a function of distance. Through a randomization procedure, we rank candidates by p-values. We illustrate our approach on four monogenic diseases and successfully prioritize the known disease causing genes

Lirias

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Disease-Aging Network Reveals Significant Roles of Aging Genes in Connecting Genetic Diseases

Author: A Budovsky
A Budovsky
A Friedman
A Kowald
A Kriete
A Ozgur
AL Barabasi
C Soti
D Harman
David B. Searls
DJ Watts
E Ravasz
G Jin
GRG Lanckriet
H Kitano
H Xue
HD Osiewacz
HJ Kiss
I Feldman
J Hasty
JDJ Han
Jiguang Wang
JP de Magalhaes
JP de Magalhaes
JR Managbanag
KI Goh
L Hayflick
Luonan Chen
M Wolfson
MEJ Newman
P Shannon
P Zuppan
PF Jonsson
Q Cui
R Albert
R Bell
RI Kondor
S Karni
S Maere
S Maslov
S Peri
S Vasto
Shihua Zhang
T Ideker
T Ishunina
TBL Kirkwood
U Brandes
U Stelzl
X Jiang
X Wu
Xiang-Sun Zhang
Y Li
Yong Wang
Z Spiro
Z Tu
Publication venue: Public Library of Science
Publication date: 01/09/2009
Field of study

One of the challenging problems in biology and medicine is exploring the underlying mechanisms of genetic diseases. Recent studies suggest that the relationship between genetic diseases and the aging process is important in understanding the molecular mechanisms of complex diseases. Although some intricate associations have been investigated for a long time, the studies are still in their early stages. In this paper, we construct a human disease-aging network to study the relationship among aging genes and genetic disease genes. Specifically, we integrate human protein-protein interactions (PPIs), disease-gene associations, aging-gene associations, and physiological system–based genetic disease classification information in a single graph-theoretic framework and find that (1) human disease genes are much closer to aging genes than expected by chance; and (2) diseases can be categorized into two types according to their relationships with aging. Type I diseases have their genes significantly close to aging genes, while type II diseases do not. Furthermore, we examine the topological characters of the disease-aging network from a systems perspective. Theoretical results reveal that the genes of type I diseases are in a central position of a PPI network while type II are not; (3) more importantly, we define an asymmetric closeness based on the PPI network to describe relationships between diseases, and find that aging genes make a significant contribution to associations among diseases, especially among type I diseases. In conclusion, the network-based study provides not only evidence for the intricate relationship between the aging process and genetic diseases, but also biological implications for prying into the nature of human diseases

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data

Author: A Kuzniar
A Vazquez
Aalt D. J. van Dijk
AJ Enright
C Moler
Cajo J. F. ter Braak
CJF Ter Braak
CJF Ter Braak
CM Federovitch
DJC MacKay
GD Bader
GR Lanckriet
H Lee
I Kosmidis
I Ulitsky
Iddo Friedberg
IM Cheeseman
J Besag
JA Hanley
L Milligan
L Peña Castillo
M Ashburner
M Deng
M Deng
M Punta
Marco C. A. M. Bink
N Nariai
NJ Mulder
P McCullagh
R Sharan
RI Kondor
Roeland C. H. J. van Ham
S Ferré
S Geman
S Letovsky
S Mostafavi
SF Altschul
SR Collins
SZ Li
T Gabaldon
U Karaoz
V Vethantham
XL Chen
Y Chen
Y Guan
Yiannis A. I. Kourmpetis
Z Barutcuoglu
Z Wei
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Inference of protein functions is one of the most important aims of modern biology. To fully exploit the large volumes of genomic data typically produced in modern-day genomic experiments, automated computational methods for protein function prediction are urgently needed. Established methods use sequence or structure similarity to infer functions but those types of data do not suffice to determine the biological context in which proteins act. Current high-throughput biological experiments produce large amounts of data on the interactions between proteins. Such data can be used to infer interaction networks and to predict the biological process that the protein is involved in. Here, we develop a probabilistic approach for protein function prediction using network data, such as protein-protein interaction measurements. We take a Bayesian approach to an existing Markov Random Field method by performing simultaneous estimation of the model parameters and prediction of protein functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to more accurate parameter estimates and consequently to improved prediction performance compared to the standard Markov Random Fields method. We tested our method using a high quality S.cereviciae validation network with 1622 proteins against 90 Gene Ontology terms of different levels of abstraction. Compared to three other protein function prediction methods, our approach shows very good prediction performance. Our method can be directly applied to protein-protein interaction or coexpression networks, but also can be extended to use multiple data sources. We apply our method to physical protein interaction data from S. cerevisiae and provide novel predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we evaluate the predictions using the available literature

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Wageningen University & Research Publications

Support vector machines

Author: Cristianini N
De Bie T
Fletcher R
Jaakkola T
Joachims T
Kashima H
Kondor RI
Lanckriet GR
Schölkopf B
Vapnik VN
Vert JP
Vinokourov A
Publication venue: 'Wiley'
Publication date: 01/01/2009
Field of study

Support vector machines (SVMs) are a family of machine learning methods, originally introduced for the problem of classification and later generalized to various other situations. They are based on principles of statistical learning theory and convex optimization, and are currently used in various domains of application, including bioinformatics, text categorization, and computer vision

Crossref

Archivio della ricerca - Fondazione Bruno Kessler

Exploiting multi-context analysis in semantic image classification

Author: A Vailaya
B Schölkopf
D Cai
D Cai
D Jensen
DG Li
G Siolas
Gao Wen
H Yu
Huang Tie-jun
J Kandola
J Mohr
J Neville
N Cristianini
R Lempel
R Zhao
RI Kondor
S Deerwester
S Paek
T Gärtner
T Joachims
Tian Yong-hong
WY Ma
XJ Wang
Y Yang
YH Tian
YH Tian
Z Chen
Publication venue: 'Zhejiang University Press'
Publication date
Field of study

Crossref